Toward self-documenting, customizable microbiome analysis and metaanalysis with scikit-bio, QIIME 2 and Qiita.

Greg Caporaso and Rob Knight

Caporaso Lab, Northern Arizona University

Knight Lab, University of California San Diego

QIIME

  • Cited 2969 times (as of 16 July 2015) since publication.
  • Nearly monthly workshops, somewhere in the world.
  • The next generation of QIIME: scikit-bio, QIIME 2 and Qiita.
    • Simplified installation.
    • Browser-based interface will replace/supplement the command line interface.
    • Workflow transparency.
    • Plugin framework, so other developers can build on QIIME.

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.

QIIME 2

A stable and well-documented API and command line interface.

   

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at readIAB.org.

Qiita

End-user, browser-based environment for microbiome analysis and meta-analysis.

   

QIIME 2

A stable and well-documented API and command line interface.

   

scikit-bio

Low-level, a bioinformatics library for data scientists, students, educators, and bioinformatics software developers.

Free, Sloan-funded companion text, An Introduction to Applied Bioinformatics available at [readIAB.org

A scikit is a python scientific computing library that cleanly integrates with others.

scikit-bio is built on top of scipy, numpy, pandas, etc.

This enables:

  • beautiful visualizations with seaborn
  • powerful machine learning and optimization with scikit-learn
  • fast sequence collapsing with marisa-trie and related packages
  • statistical modeling with Stats.Models
  • whatever comes next...

scikit-bio 0.4.0, our first beta release, is now available

Developing your bioinformatics software with scikit-bio gets you a lot of functionality for free.

See Jai Rideout and Evan Bolyen's SciPy 2015 talk for more detail and cool demos http://bit.ly/skbio-scipy2015

(We'll tweet the link with #microbenet.)

Example: building a better taxonomy classifier in under 100 lines of code


In [ ]:
from skbio import DNA
import numpy as np
import skbio

## Load Greengenes and slice all sequences to the region 
## amplified by 515F/806R. 

aligned_seqs_fp = 'data/gg_13_8_otus/rep_set_aligned/82_otus.fasta'
taxonomy_fp = 'data/gg_13_8_otus/taxonomy/82_otu_taxonomy.txt'


fwd_primer = DNA("GTGCCAGCMGCCGCGGTAA",
                 metadata={'label':'fwd-primer'})
rev_primer = DNA("GGACTACHVGGGTWTCTAAT",
                 metadata={'label':'rev-primer'}).reverse_complement()

def seq_to_regex(seq):
    result = []
    for base in str(seq):
        if base in DNA.degenerate_chars:
            result.append('[{0}]'.format(
                ''.join(DNA.degenerate_map[base])))
        else:
            result.append(base)

    return ''.join(result)

regex = '({0}.*{1})'.format(seq_to_regex(fwd_primer),
                            seq_to_regex(rev_primer))

starts = []
stops = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta', 
                         constructor=DNA):
    for match in seq.find_with_regex(regex, ignore=seq.gaps()):
        starts.append(match.start)
        stops.append(match.stop)
        
locus = slice(int(np.median(starts)), int(np.median(stops)))

## Get all kmers for all sequences.
kmer_counts = []
seq_ids = []
for seq in skbio.io.read(aligned_seqs_fp, format='fasta',
                         constructor=DNA):
    seq_ids.append(seq.metadata['id'])
    sliced_seq = seq[locus].degap()
    kmer_counts.append(sliced_seq.kmer_frequencies(8))
    
    
## Load the training/test data, perform feature selection, and train 
## and test the classifier using cross-validation.
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectPercentile
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

X = DictVectorizer().fit_transform(kmer_counts)

taxonomy_level = 3 # class
id_to_taxon = {}
with open(taxonomy_fp) as f:
    for line in f:
       id_, taxon = line.strip().split('\t')
       id_to_taxon[id_] = '; '.join(taxon.split('; ')[:taxonomy_level])

y = [id_to_taxon[seq_id] for seq_id in seq_ids]

X = SelectPercentile().fit_transform(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
y_pred = SVC(C=10, kernel='linear', degree=3,
             gamma=0.001).fit(X_train, y_train).predict(X_test)

## Define a function to use for plotting. 
%matplotlib inline
import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.ylabel('Known taxonomy')
    plt.xlabel('Predicted taxonomy')
    plt.tight_layout()
    plt.show()

Support Vector Machine genus classifier

Naive Bayes genus classifier

QIIME 2

Self-documenting analyses simplify bioinformatics methods reporting, and therefore reproducibility and transparancy.